-
-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stepsize no longer reset to 1 if term_buffer = 0 (Issue #3023) #3027
base: develop
Are you sure you want to change the base?
Conversation
@funko-unko want to verify this fixes the problem? To get a copy of cmdstan with this patch do: git clone --recursive https://github.com/stan-dev/cmdstan cmdstan-issue-3023
cd cmstan-issue-3023/stan
git checkout bugfix/issue-3023
cd ../
make build |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
I'll try it out later. |
Changing the return code means that stepsize adaptation does not restart after the last metric window. The last window is usually longer than I think it would be more conservative to simply change the restart value of stan/src/stan/mcmc/stepsize_adaptation.hpp Lines 49 to 53 in 37ad8c3
The first call to learn_stepsize() overwrites x_bar_ anyway so the initial value here is irrelevant unless you call complete_adaptation() immediately without learn_stepsize() in between (i.e. term_buffer = 0 ), in which case the stepsize becomes exp(x_bar_) .The sampler sets mu_ = log(10*stepsize) right before restart() so the natural restart value for x_bar_ would bemu_ - log(10) .
|
Is the trick that: x_bar_ = (1.0 - x_eta) * x_bar_ + x_eta * x; 1.0 - x_eta is always zero on the first iteration cause counter is always 1? I think the annoying thing here is that we deal with the magic constant 10 in two places. I wouldn't mind getting rid of the 10x and restarting at
I think this should still restart the step size adaptation if there is a term buffer following. At least that was the intention. If there is no term buffer, there is no reset. The logic to detect the end of the last window with a term buffer is: void compute_next_window() {
if (adapt_next_window_ == num_warmup_ - adapt_term_buffer_ - 1)
return; And the finished thing here is detecting instead: bool finished() {
if (adapt_window_counter_ + 1 >= num_warmup_) {
return true; |
Yes
No, not two.
(And that's just
Related to issue #2670 That 10x isn't the only janky thing about this. stan/src/stan/mcmc/hmc/base_hmc.hpp Line 103 in 37ad8c3
(and shouldn't that be abs(delta_H) > abs(std::log(0.8)) ?
I think the "correct" stepsize multiplier when changing windows is probably something like the largest eigenvalue of (old metric / new metric).
Ah, you're right. |
The current behavior is intended. After the final windowed adaptation update there is little guidance on the step size for the terminal buffer window. The step size is initialized to 1, which is roughly near the value for a multivariate normal target of low dimension if the metric has been adapted perfectly. More generally, however, the optimal step size for sampling purposes is not simply related to the eigenvalues of M^{-1} Covar(target) and hence the scaling in the optimal step size is not simply related the eigenvalues of M_{new}^{-1} M_{old}. The dual averaging adaptation sets mu to 10 times the initial epsilon to encourage more aggressive adaptation early on; this is a common heuristic for setting mu when optimization. The objective of the step size initialization is only to find a rough value and the precise adaptation target is not especially important. In particular the initialization considers the acceptance probability at one step which is not characteristic of the acceptance probability averaged over the entire trajectory (either uniformly weighted as is currently implemented or properly weighted as it should be implemented). Because there is not great way to predict a good step size after updating the inverse metric a non-zero terminal buffer window was always intended to be run. Given that users are running without a terminal buffer window a message warning users that the step size will not be adapted with |
I don't know whether I'm the only one doing this, but please don't issue a warning and please report the last used step size. Thanks. |
Well that doesn't mean we can't change it. I don't really see the importance of setting this to 1.0. When we continue adaptation we're setting it to 10x the old timestep anyway. If anyone wants to use the stepsize 1 for something, they can just use stepsize 1. Sure the adaptation might be bad, but the behavior is confusing, stepsize = 1 will probably be bad too, and this is an easy to fix. |
I agree with @bbbales2 & @funko-unko that despite it's not a practice we want to recommend if someone wants to set term_buffer=0 the outcome of stepsize=1 violates the least surprise principle. |
My use case is of course to use only the first two adaptation windows on a series of converging/contracting posteriors, and only for the last one to use |
I thought more about this because it feels like a multivariate normal target with perfect adaptation should be simple enough to solve analytically. Working with small stepsize approximation I got fairly strong bounds for single-step accept probability but of course that may not be indicative of longer trajectory or more complicated target distribution. |
@betanalpha Any further thoughts on this? I would like to make this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To directly address #3023 , can we add a test to call service function like hmc_nuts_dense_e_adapt
with term_buffer=0
and check adapted stepsize?
Oh, right, I'll have a look whenever my current computation finishes... |
I added a direct test. It seems a little weird without context (like someone coming along might not understand why the stepsize should not be 1), so I'm down to add a comment about why it is there, but it serves the purpose. Btw don't approve or merge on this until @betanalpha okays it or we just disagree and vote on it. Thanks for the review though. |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
Let me reiterate the intended behavior of the adaptation.
The init buffer gives the sampler time to find the typical set, adapting the step size along the way to facilitate the search.
The intermediate adaptation windows try to aggregate information about the global covariance structure. In these windows the inverse metric is not adapted dynamically, which is prone to unstable behavior, but the step size is. After each adaptation window the inverse metric is updated with the aggregated information and the step size is reset to a default near 1 _because there is no other decent guess_.
Because of this compartmentalization the intermediate windows _never leave the sampler configuration in a good state_. This isn’t a problem when applying another intermediate window because the step size is adapted within those windows, but it is when the final intermediate window has completed. The entire role of the terminal buffer is to learn a final stepwise and bring the sampler back into a good state. In other words term_buffer = 0 is not appropriate whenever there are intermediate adaptation windows.
term_buffer = 0 was included in case the adaption windows were turned off. In other words if one wanted to run only an initial buffer window to learn a step size but keep the inverse metric at its initial value (either the identify or whatever is specified externally).
A warning if term_buffer = 0 but window != 0, as well as additional documentation, is entirely appropriate here. Trying to hack the adaptation to allow fragile, unintended use cases is not.
I agree with @bbbales2 <https://github.com/bbbales2> & @funko-unko <https://github.com/funko-unko> that despite it's not practice a we want to recommend but if someone wants to set term_buffer=0 the outcome of stepsize=1 violates the least surprise principle <https://en.wikipedia.org/wiki/Principle_of_least_astonishment>.
The problem with the least surprise principle in general is that it’s subjective; surprise will always be defined by prior expectations which are not uniform across individuals.
For example here the problem isn’t in the behavior of the terminal adaptation buffer but rather in misconceptions about how the _intermediate_ adaptation windows work.
My use case is of course to use only the first two adaptation windows on a series of converging/contracting posteriors, and only for the last one to use term_buffer!=0, as described here https://discourse.mc-stan.org/t/fitting-ode-models-best-efficient-practices/21018/60?u=funko_unko <https://discourse.mc-stan.org/t/fitting-ode-models-best-efficient-practices/21018/60?u=funko_unko>Did you find any documentation explaining that this is a valid use case of the adaptation?
I thought more about this because it feels like a multivariate normal target with perfect adaptation should be simple enough to solve analytically. Working with small stepsize approximation I got fairly strong bounds for single-step accept probability but of course that may not be indicative of longer trajectory or more complicated target distribution.
Notes attached if someone wants to check my math.
The exact result can be worked out by generalizing the derivation in A.3 of https://arxiv.org/abs/1411.6669 to include non-identify covariance and/or metric.
Is the jump from || . ||_{-} to | | correct here? L^{-1} M^{-1} isn’t positive definite and so the distribution of acceptance probabilities shouldn’t be symmetric. Indeed even in the normal case the Hamiltonian error doesn’t oscillate symmetrically around zero but rather is biased towards larger energies.
In any case as you note the limitation here is that the result holds only for one short step, but that short step recovers the Langevin limit. The optimal Langevin acceptance probability and step size are not the same for Hamiltonian Monte Carlo with nontrivial integration times.
|
Actually, the stepsize does not reset to one between the windows. Instead the algorithm invokes
I'm only using the leading order epsilon^3 term for the single-step energy error. That term is linear in p and the distribution of p is certainly symmetric. The bias you note is of order epsilon^4 so it's negligible in this approximation. (The approximation may well be too crude to be interesting.) |
As far as I know the initial guess for the stepsize in the next interval is 10x the last one.
Nah nah I don't think it's a misconception -- I see what you're saying here (after a metric is updated, because the stepsize corresponds to the previous metric we'd want to update it). I don't think that changing the behavior for term_buffer = 0 is going to compromise anything. Like sure the sampler is in a dubious state, but the user requested term_buffer = 0 so, yeah, that's intentional, and they're following up with more adaptation where the stepsize can change (and doing a non-zero term-buffer at the end of it all).
I do like the idea of not doing the 10x, which seems pretty crude. @funko-unko how are you currently working around this? |
Oh yeah I guess -- is everyone here comfortable if we just try to do a vote on this? Like, the SGB rules say votes for major issues and this seems quite small, but I'm happy just manufacturing a decision and rolling with it rather than arguing back and forth too long. If I'm stopping the discussion too early just say so. Niko's timestep adaptation seems interesting re: the discussions here + the thing here so either way this PR goes we can still keep that it mind. |
It's not. See this code stan/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp Lines 36 to 38 in e064647
The call to init_stepsize() adjusts the stepsize between the intervals and set_mu does not set the initial guess but some kind of shrinkage target.
I don't think we're at the point where people agree to disagree. It looks more like people can't even agree what the current behaviour is. I made this plot with
|
Currently I'm just using the step sizes used for the last sample as extracted from the output csv to pass to the next warmup step. (Actually currently I compute the mean across chains, but really I haven't settled on anything yet).
No, do you think this is an invalid use case? Is there any metric that could answer that question? |
I'll have to read up on what the dual averaging thing does. I'm clueless here. Thanks for the plot.
So this is the stepsize pre-stepsize reset, right? So you're able to get the thing you need it's just kinda inconvenient? |
I double-checked and that's not supported either. Regardless of whether |
I think so, yes.
I wouldn't say I get what I need, as I do not know what I need. Using a stepsize of one did appear to work. Using the chainwise last step size appeared to work slightly better. Using the mean/median across chains appeared to work slightly better still. Using the maximal step size across chains worked well sometimes, badly other times. |
I double-checked and that's not supported either. Regardless of whether window is zero or not, term_buffer=0 causes the adaptive sampler to reset stepsize to 1. And window=0 also forces the metric to 0.001 times the identity matrix. This PR does not fix the metric reset issue.
There are two issues here.
One is the behavior of dual averaging for vanishing window sizes (the term_buffer = 0 problem). This can resolved in two ways, both of which require a second `restart` signature with an initial step size argument. The initial step size can either be used to set `x_bar_ = std::log(epsilon)` instead of zero, which changes the optimization behavior and will have to be tested externally, or it can be saved as a member variable with `complete_adaptation` updated to something like
```
void complete_adaptation(double& epsilon) {
if (counter_ > 0)
epsilon = initial_epsilon_;
}
```
which does not change the optimization behavior.
The other is that when `adapt_base_window_ = 0` the window termination code is called after every iteration between `adapt_init_buffer` and `adapt_term_buffer_` which continuously restarts the step size adaptation and tries to update the inverse metric. In some sense the behavior with `adapt_base_window = 0` and `adapt_init_buffer + adapt_term_buffer < n_warmup` is undefined, but having any iterations between `adapt_init_buffer` and `num_warmup - adapt_term_buffer` no op is probably the best option.
The quick fix here is changing `end_adaptation_window` mcmc/windowed_adaptation.hpp from
```
bool end_adaptation_window() {
return (adapt_window_counter_ == adapt_next_window_)
&& (adapt_window_counter_ != num_warmup_);
}
```
to
```
bool end_adaptation_window() {
return (adapt_window_counter_ == adapt_next_window_)
&& (adapt_window_counter_ != num_warmup_)
&& (adapt_base_window_ != 0);
}
```
and adding a warning to `set_window_params` like
“WARNING: Because base_window = 0 only a single step size adaptation will be performed across the all warmup transitions."
|
Not quite.
The initial value of x_eta = std::pow(counter_, -kappa_);
x_bar_ = (1.0 - x_eta) * x_bar_ + x_eta * x; The first update has I did notice a couple of places where the optimization behavior probably should be changed. (or at least doesn't make sense to me)
Alternatively this can be written as
The last term is identical to the adaptation scheme in section 3.2 of the Hoffman and Gelman (2014) NUTS paper. |
Not quite. adapt_next_window_ is not the first iteration of the next window but the last iteration of the current window. When adapt_base_window_ = 0 it's always the last iteration of the init buffer. There is only one reset if init_buffer > 0 and no reset at all if init_buffer = 0. In effect, the term buffer extends to fill the void left by the missing metric adaptation windows.
Technically when `adapt_base_window = 0` the next window is always set to the current iteration via `adapt_window_counter_`
```
adapt_next_window_ = adapt_window_counter_ + adapt_window_size_
= adapt_window_counter_ + 0
= adapt_window_counter_ ;
```
but `adapt_window_counter_` is incremented _after_ `compute_next_window()` is called so on the next iteration
```
adapt_window_counter_ == adapt_next_window_
== old_adapt_window_counter_
== adapt_window_counter_ - 1
```
will evaluate to false and the adaptation will not trigger an update. But regardless the outcome is the same — only one reset immediately after the `init_buffer` is completed.
The initial step size can either be used to set x_bar_ = std::log(epsilon) instead of zero, which changes the optimization behavior
The initial value of x_bar_ does not change the optimization behavior. learn_stepsize() updates x_bar_ according to
x_eta = std::pow(counter_, -kappa_);
x_bar_ = (
1.0 - x_eta) * x_bar_ + x_eta * x;
The first update has counter_ = 1. Regardless of the value of kappa_ the weight (1.0 - x_eta) is 0.0 and any finite initial value of x_bar_ is completely erased.
Sorry, yes, you are correct.
The answer is that for the initial window mu_ is set before the first init_stepsize() but for later windows mu_ resets after init_stepsize(). That seems inconsistent. The stepsize is not very good before init_stepsize().
Remember that the step size is adapted to the _average_ acceptance statistic across the entire parameter space. The step size initialization heuristic approximates this average by considering just one step and just the single initial point. If that point is far outside of the typical set, however, the local accept statistic won’t say much about the average acceptance statistic.
I believe the current asymmetry is intentional — setting the initial mu based on a user-specified stepsize instead of the behavior around the initial point which may be far outside of the typical set, and then set mu based on the step size initialization heuristic only after the init_buffer has completed and the sampler has a decent shot of reaching the typical set.
• This discussion has been concerned with the "surprising" result when term_buffer = 0. But let's take a quick look at term_buffer = 1. Of course the step size then depends on what the accept_stat__ is for that single adaptation transition but I think that if the accept_stat__ just happens to equal adapt_delta then the "least surprising" result is that the step size does not change. It might not be exactly optimal but there's no clear reason to move it in either direction, right?
The current behavior in that scenario is to set the step size to exp(mu_), i.e. 10 times the initial stepsize which feels a bit too large.
There's an easy fix: make restart() initialize s_bar_ = gamma*log(10) instead of the current s_bar_ = 0.
The most obvious side effect is removing the high spike right after every metric update.
Any average-based adaptation mechanism is going to be a bit unstable after only one iteration.
Primal dual-averaging is set up to shrink the adapted parameter towards mu, with gamma setting the strength of that shrinkage. After 1 iteration where the acceptance statistic is exactly delta defaulting to that shrinkage value is natural response. This appears aggressive only because mu is set to be aggressive; those spikes are the adaptation aggressively trying out larger values of epsilon near mu to see if they’re any better before settling down to a smaller value.
The problem with changing gamma is that it affects the shrinkage towards mu through the entire adaptation sequence, not just after a few iterations. If someone wants to run with a very small terminal buffer then they can set gamma to a very small value so that the shrinkage to mu is much weaker.
But wait, the multiplier (1-1/(t0+n))*sqrt(n/(n-1)) is larger than 1.0 when n < t0 so for the first ten iterations it pushes the stepsize away from mu rather than pulling towards mu. That can't be right, can it?
For n = 0 we have (1 - eta) = (1 - 1 / (t0 + n) ) = 1 - 1 / t0 < 1 because t0 > 0.
For n > 0 we have (1 - eta) = 1 - 1 / (t0 + n) which is also bounded between 0 and 1 because both t0 and n are positive.
|
Submission Checklist
./runTests.py src/test/unit
make cpplint
Summary
Fix for #3023
Side Effects
I changed how the return codes from
covar_adaptation::learn_covariance
andvar_adaptation::learn_variance
worked.Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: